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ABSTRACT 

This  thesis  introduces  the  concept  of  a  distributed  diagnosis  algorithm  in  the  context 
of  the  Preparata-Metze-Chien  (PMC)  model.  It  represents  a  Computer-Aided-Design 
(CAD)  tool  for  use  in  analyzing  such  algorithms.  That  is,  with  this  tool,  the  user  can 
establish  a  multiprocessor  system,  a  set  of  test  outcomes  and  then  analyze  the  properties 
of  a  specified  distributed  diagnosis  algorithm. 

Examples  in  this  thesis  include  a  system  in  which  ; 

1.  Correct  diagnosis  is  achieved  in  a  small  number  of  iterations. 

2.  Correct  diagnosis  is  never  achieved. 

3.  An  oscillating  situation  exists  in  which  faulty  processors  become  alternately 
enabled  and  disabled. 
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I.  INTRODUCTION 

A.    NEED  FOR  STUDY 

The  advent  of  inexpensive  microprocessor  elements  has  made  multiprocessor 
computing  networks  much  more  practical.  This  fact  has  led  to  an  increasing  interest  in 
the  high  reliability  of  such  networks.  The  prospect  of  ultra  reliability  has  inspired 
research  into  the  use  of  computers  where  low  reliability  precluded  its  previous  use.  This 
includes  aircraft  control  systems,  where  the  Federal  Aeronautic  Administration  (FAA)  has 

Q 

specified  as  a  standard  probability  of  failure  in  a  10  hour  operating  period  of  10    [Ref.  1]. 

The  traditional  approach  to  computer  reliability  is  through  redundancy,  where  reliable 
outputs  are  the  result  of  a  vote  on  three  or  more  less  reliable  outputs.  In  the  theory  of 
system  diagnosis  [Ref.  2],  a  graph  is  used  to  model  a  multiprocessing  system  where  nodes 
represent  the  processors  and  arcs  represent  tests  between  processors.  One  goal  of  the 
theory  is  to  determine  what  tests  achieve  the  highest  tolerance  to  faults.  It  has  been 
shown  [Ref.  3]  that  for  the  same  system  reliability,  greater  throughput  can  be  achieved 
from  system  diagnosis  approach  than  modular  redundancy.  Conversely,  for  the  same 
throughput,  a  system  diagnosis  approach  yields  greater  reliability  [Ref.  3]. 

Beginning  with  the  Preparata-Metze-Chien  model,  many  models  have  been  developed 
for  system  diagnosis.  The  best  known  models  are  [Ref.  4]. 

1.  Preparata-Metze-Chien(PMC)  model:  This  model  was  used  in  this  research  and 
will  be  explained  in  Chapter  II.  This  model  is  represented  by  Ap  in  Table  1.1. 

2.  Perfect  Tester:  In  this  model,  test  outcomes  correspond  to  perfect  diagnosis  of 
faulty  units.  In  other  words,  if  the  tested  unit  is  faulty  (not  good),  no  matter  what  the 
status  of  testing  unit  is  (faulty  or  fault-free),  the  test  outcome  will  be  fail(l).  If  the  tested 
unit  is  fault-free(good),  the  test  outcome  will  be  pass(0)  regardless  of  the  status  of  the 
testing  unit.  This  model  is  represented  by  Aa  in  Table  1.1. 

3.  1-Fail  safe  tester:  This  model  never  has  an  incorrect  zero.  That  means  that  there 
might  be  incorrect  fail  test  outcomes  (e.g.,  when  faulty  unit  is  testing  a  fault-free  unit  the 
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test  outcome  will  be  1),  but  there  will  never  be  any  incorrect  pass  (0)  outcome.    It  is 
represented  by  Aw  in  Table  1.1. 

4.  O-Fail  safe  tester:  This  model  never  has  incorrect  1.  That  is,  when  a  faulty  unit 
tests  another  faulty  unit,  the  test  outcome  will  be  0.  This  is  an  incorrect  pass  outcome. 
However,  there  is  no  incorrect  fail  test  outcome.  The  model  is  represented  by  Ay  in 
Table  1.1. 

5.  Ab  is  a  model  in  which  a  faulty  unit  will  never  incorrectly  diagnose  another  faulty 
unit.  However,  in  this  model  a  faulty  unit  testing  a  fault-free  unit  will  produce  0  and  1 
arbitrarily. 

6.  A  p.  is  a  model  in  which  a  faulty  testing  unit  may  not  correctly  diagnose  another 
faulty  unit.  Test  outcomes  can  be  0  and  1  arbitrarily. 

7.  Ax  is  a  model  in  which  a  faulty  testing  unit  always  diagnoses  a  fault-free  unit 
incorrectly,  producing  fail  test  outcome.  However,  a  faulty  testing  unit  produces  0  and  1 
arbitrarily  for  a  faulty  tested  units. 

8.  Partial  tester:  In  this  model,  there  is  the  possibility  that  a  fault-free  testing  unit 
cannot  correctly  diagnose  a  faulty  unit.  This  model  is  examined  by  Simoncini  and 
Friedman  [Ref.  5].  They  considered  the  problem  where  system  tests  may  be  incomplete, 
i.e.,  that  is  a  fault-free  unit  may  be  able  to  detect  faulty  units  with  percentage  p  (p  < 
100).  This  model  is  represented  by  Apt  in  Table  1.1. 

9.  Zero  information  tester:  This  model  provides  no  reliable  test  outcomes.  This 
model  was  considered  by  Marion  L.  Blount  [Ref.  6].  Several  different  fault  detection 
requirements  can  be  addressed. 

a.  A  fault-free  unit  can  fail  to  diagnose  another  fault-free  unit. 

b.  A  fault-free  unit  can  fail  to  diagnose  a  faulty  unit. 

c.  A  faulty  unit  can  give  a  correct  diagnosis  of  another  unit  (faulty  or  fault-free). 
This  model  is  represented  by  Ao  in  Table  1 . 1 
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Aa 
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Ay 

A*l 

AX 

Ap 

Apt 

A( 

ay 

0->0 

0 
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0 

0 
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0 

X 

0->0 

1 

1 

1 

1 

1 

1 

1 

X 

X 

0->0 

0 

1 

X 

0 

0 

1 

X 

X 

X 

0->0 

1 

1 

1 

0 

X 

X 

X 

X 

X 

O-Fault-free  unit 

O-Faulty 

unit 

Table  1 . 1  Different  models  of  system  diagnosis 

All  the  models  mentioned  previously  apply  to  a  graph  theoretic  system.  Analysis  of 
such  systems  is  typically  done  by  hand  calculation  which  limits  the  number  of  units. 
System  fault  configurations  is  limited  to  some  small  numbers  as  well.  Thus,  the  analysis 
of  such  theory  is  difficult.  Also,  there  is  much  interest  in  making  the  model  more 
realistic.  This,  in  fact,  inspired  the  models  described.  For  example,  Ab  proposed  to 
model  tests  among  processors  consisting  of  comparing  results  of  computations.  The  goal 
of  this  thesis  is  to  further  improve  the  model.  Specifically,  it  addresses  the  problem  of 
reconfiguration,  where  there  has  been  relatively  little  study  so  far. 

B.    PROBLEM  ENVIRONMENT 

The  fault  diagnosis  problem  is  to  determine  faulty  processors  given  the  set  of  test 
outcomes.  Almost  all  previous  studies  have  assumed  a  central  diagnoser,  which  collects 
all  of  the  test  results  and  identifies  faulty  processors  from  this.  This  assumption 
simplifies  the  problem  and  avoids  the  complexities  of  reliable  replacement.  But  a  central 
diagnoser  is  also  a  processor,  which  might  fail.  In  this  case,  system  diagnosis  may  not  be 
accurate.  To  provide  accurate  system  diagnosis,  the  central  diagnoser  should  be  ultra 
reliable.   This  will  be  expensive  and  will  require  extra  maintenance  effort.  To  overcome 


12 


these  difficulties,  distributed  system  diagnosis  is  proposed.  In  the  distributed  systems 
proposed  here,  the  hardware  required  to  achieve  reliability  is  simple  and  can  be  made 
ultra  reliable  inexpensively. 
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II.  BACKGROUND 

A.    PREPARATA-METZE-CHIEN  (PMC)  GRAPH  MODEL 

A  multiprocessing  system  is  composed  of  n  processors.  Each  processor  is  called  a 
unit  (node)  where  a  unit  is  a  well-identifiable  portion  of  the  system  which  cannot  be 
further  decomposed  for  the  purpose  of  diagnosis.  Units  are  indicated  by  Ui ,  0  <  i  <  n-1. 
These  units  must  be  powerful  enough  to  test  other  individual  subunits.  A  test 
corresponds  to  an  arc  between  processors  with  the  arrow  pointing  to  the  tested  unit.  Arcs 
are  denoted  by  a  i  j,  where  i  is  the  unit  number  which  is  doing  the  test,  and  j  is  the  unit 
number  which  is  tested.  Each  test  has  two  outcomes,  pass  and  fail;  O's  correspond  to 
pass  test  outcomes  and  l's  correspond  to  fail  test  outcomes.  Faulty  processors  are 
indicated  by  X's.  Figure  2.1  shows  a  5  processor  multiprocessor  system,  where  U2  and 
U3  are  faulty.  A  test  is  meaningful  only  if  the  testing  unit  itself  is  fault-free;  otherwise 
the  test  outcome  is  unreliable. 


U 


U 


u 


o 


o 


u 


u 


Figure  2. 1  Five  processors  multiprocessor  system  with  faulty  units  and  test  outcomes 
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Figure  2.2  shows  how  test  results  occur  in  the  model  we  have  chosen.  The  top  arc 
goes  from  a  fault-free  node  to  a  fault-free  node  and  for  this  case  a  0  (pass)  outcome  is 
always  produced.  The  second  arc  goes  from  a  fault-free  node  to  a  faulty  node  and  for  this 
case  a  1  (fail)  outcome  is  always  produced.  The  third  arc  goes  from  faulty  node  to 
fault-free  node  and  fourth  arc  goes  from  faulty  node  to  faulty  node.  The  outcomes  of  the 
last  two  cases  are  unpredictable  and  can  be  0  or  1  arbitrarily. 

Definition  1:  The  set  of  test  outcomes  aij  represents  the  syndrome  of  the  system; 
obviously  aij  can  be  assigned  if  and  only  if  the  corresponding  testing  link  exists.  [Ref.  3: 
p-848].  In  Figure  2.1  the  syndrome  of  the  system  for  one  loop  will  be  (aoi,  ai2,  a23,  a34, 
a40)  where  the  left  to  right  arrangement  of  the  aij  is  intended  to  reflect  the  direction  of  the 
loop.  Diagnosis  is  the  process  of  determining  the  faulty  units  given  a  set  of  test  outcomes. 
At  this  point,  we  need  to  define  distinguishable  and  indistinguishable  fault  patterns. 

U.  U  . 

•  J 

o -  o         ■„-« 


o 


►  o 


IJ 


a.=x 


a.    -X 


i) 


Q    Fault-free  A      Faulty 

Figure  2.2  Assumed  test  outcomes  in  Preparata-Metze-Chien  Model 
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4  o  o 

t       I 


U3 

r^  <* 

r^   u 

lj  * 

\J           2 

aoi 

a 
1  2 

a 
23 

33  4 

a 
40 

a) 

X 

0 

0 

0 

1 

b) 

1 

X 

0 

0 

0 

c) 

X 

X 

0 

0 

1 

Figure  2.3  A  system  and  associated  test  outcomes 

Faults  in  units  Ui  and  Uj  are  distinguishable  if  the  syndromes  associated  with  them  are 
different.  The  two  faults  are  indistinguishable  if  the  syndromes  associated  with  two 
different  faults  are  the  same.  These  definitions  may  be  directly  extended  to 
distinguishable  and  indistinguishable  sets  of  faults  called  fault  patterns.  Figure  2.3 
depicts  a  system  and  its  test  outcomes  for  three  different  cases.  If  Uo  is  faulty,  the 
syndrome  shown  in  line  a  is  produced.  If  Ui  is  faulty,  the  syndrome  shown  in  line  b  is 
produced.  They  are  distinguishable  since  the  value  a40  is  different.  The  multiple  fault 
pattern  (Uo,  Ui  are  faulty)  has  the  syndrome  in  line  c,  and  since  it  may  be  the  same  as  the 
syndrome  for  faults  {Uo}  (depending  on  the  unpredictable  values  of  aoi  and  an),  {Uo} 
and  {Uo,  Ui}  are  indistinguishable. 
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B.    ONE-STEP  T-FAULT  DIAGNOSABLE  SYSTEMS 

Definition  2:  A  system  of  n  units  is  one-step  t-fault  diagnosable  if  all  faulty  units 
within  the  system  can  be  uniquely  identified,  provided  the  number  of  faulty  units  present 
does  not  exceed  t  [Ref.  3]. 

1.  NECESSARY  AND  SUFFICIENT  CONDITIONS: 

In  this  section  we  investigate  the  relationship  between  n  and  t  (the  number  of 
faulty  units),  for  one-step  diagnosable  systems. 

Theorem  1:  If  a  system  with  n  units  is  one-step  t-fault  diagnosable,  then  n  >  2t+l. 
Conversely,  if  n  >  2t+l,  it  is  always  possible  to  provide  a  connection  to  form  a  system 
that  is  one-step  t-fault  diagnosable  [Ref.  3]. 

Proof:  To  prove  the  converse,  we  construct  a  maximally  connected  graph,  that  is, 
we  make  a  connection  among  all  possible  pairs  of  these  n  units  in  both  directions.  One 
characteristic  of  such  a  graph  is  that  there  exists  a  loop  connecting  any  subset  of  n  units. 
It  is  easily  verified  that  given  any  loop  connecting  z  units  with  all  test  outcomes  in  the 
loop  exhibiting  the  value  0,  then  the  z  units  in  the  loop  are  either  all  faulty  or  fault-free. 
In  particular,  if  z  >  t+1,  all  units  in  the  loop  must  be  fault-free.  Otherwise,  this  would 
violate  the  hypothesis  on  the  maximum  number  of  faulty  units.  The  location  of  a  loop  of 
t+1  or  more  fault-free  units  will  essentially  have  completed  the  diagnosis  process,  and  any 
identified  fault-free  unit  will  immediately  locate  all  faulty  units  through  direct  links. 
Since  the  system  can  have  at  most  t  faulty  units,  it  must  contain  at  least  t+1  fault-free 
units;  hence  the  existence  of  a  loop  of  t+ 1  or  more  fault- free  units  is  guaranteed. 

For  a  system  with  n  <  2t+l  units  and  an  arbitrary  connection,  we  show  the 
existence  of  two  distinct  allowable  fault  patterns  that  may  result  in  exactly  the  same 
syndrome.  An  allowable  fault  pattern  for  our  specific  case  is  any  fault  pattern  with  at 
most  t  faulty  units.  We  can  consider  n  as  odd  and  even  in  two  separate  cases;  but  both 
cases  are  analogous.  Assume  n  <  2to,  with  to  <  t.  Consider  the  case  of  an  even  number  of 
nodes.  We  partition  the  system  into  two  parts,  Pi  and  P2,  each  with  the  same  amount  of 
units  to.  Suppose  all  units  in  Pi  are  faulty  and  all  units  in  P2  are  fault-free.  Then,  all  links 
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between  units  within  P2  will  have  a  value  0  and  all  links  pointing  from  units  in  P2  to  units 
in  Pi  will  have  a  value  1.  Since  the  units  in  Pi  are  faulty,  many  possible  configurations  of 
values  may  occur.  One  such  possible  configuration  is  for  all  links  between  units  in  Pi  to 
have  a  value  0  and  all  links  pointing  from  units  in  Pi  to  units  in  P2  to  have  value  1.  From 
symmetry,  it  is  seen  that  when  all  units  in  Pi  are  fault-free  and  all  units  in  P2  are  faulty, 
the  same  pattern  of  test  results  may  occur.  Hence,  it  is  not  always  possible  for  the  system 
to  differentiate  between  the  two  allowable  fault  patterns  and  the  system  is  not  one-step 
t-fault  diagnosable  [Ref.  3:  p-850]. 

2.  OPTIMAL  DESIGNS  FOR  ONE-STEP  t-FAULT  DIAGNOSABILITY: 

For  this  model  it  has  been  shown  that  the  number  of  units  n  must  be  at  least  2t+l 
for  a  system  to  be  one-step  diagnosable.  Now  we  will  try  to  get  the  lower  bound  on  the 
number  of  units  that  concurrently  test  a  particular  unit. 

Theorem  2:  In  a  one  step  t-fault  diagnosable  system,  a  unit  is  tested  by  at  least  t 
other  units  [Ref.  3:  p-850]. 

Proof:  On  the  hypothesis  that  the  system  is  one-step  t-fault  diagnosable,  we  may 
assume  that  Ui,  U2,....,Uk  are  all  the  units  in  the  system  which  test  a  certain  unit  Uo  and 
k  <  t.  Consider  the  case  in  which  Ui,  U2,  ...,Uk  are  all  faulty.  The  outcome  of  the  tests 
performed  by  these  faulty  units  may,  of  course,  assume  arbitrary  values.  Hence  there  is 
no  reliable  test  being  performed  on  Uo,  and  the  two  legitimate  fault  patterns  (Ui,  U2, 
...,Uk)  and  (Uo,  Ui,  U2,  ...,Uk)  neither  of  which  has  more  than  t  faults  are  not 
distinguishable.  Hence  according  to  Definition  2,  the  system  is  not  one-step  t-fault 
diagnosable.  Since  a  contradiction  has  been  arrived  at,  the  assertion  stated  in  the  theorem 
is  proved. 

Definition  3:  A  one-step  t-fault  diagnosable  system  is  said  to  be  optimal  if  n  = 
2t+l  and  each  is  tested  by  exactly  t  units  [Ref.  3:  p-850]. 

In  general,  many  optimal  designs  exists  for  a  system.  To  describe  these  families 
of  designs  Dt,  it  is  convenient  to  designate  the  n  units  by  Uo,  Ui,  ...,Un-l,  and  to  perform 
any  computation  on  the  subscripts  modulo  n.  We  will  consider  a  class  of  designs  in 
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which  the  testing  connection  at  each  unit  is  identical.  In  fact,  whether  there  is  a  testing 
link  from  ui  to  uj  depends  entirely  upon  the  value  of  l=j-i  (modulo  n).  A  test  exists  if  and 
only  if  1  <  1  <  t.  Preparata,  Metze  and  Chien  [Ref.  2]  showed  that  a  design  Dt  is  an 
optimal  one  step  t-fault  diagnosable  system. 

C.    SEQUENTIALLY  DIAGNOSABLE  SYSTEMS: 

Definition  4:  A  system  of  n  units  is  sequentially  diagnosable  if  at  least  one  faulty  unit 
can  be  identified  without  replacement,  provided  the  number  of  faulty  units  present  does 
not  exceed  t  [Ref.  3:  p-849]. 

It  is  obvious  that  every  system  which  is  one-step  t-fault  diagnosable  is  also 
sequentially  diagnosable.  But  a  system  which  is  sequentially  diagnosable  may  not  be 
one-step  t-fault  diagnosable.  In  the  previous  section,  we  have  seen  that  nt  links  are 
required  for  a  system  of  n  units  to  be  one-step  t-fault  diagnosable  (design  Dt).  The 
investigation  of  sequentially  diagnosable  systems  is  motivated  by  the  expectation  that 
fewer  test  links  are  required  in  such  systems.  Theorem  1  is  valid  for  sequentially 
diagnosable  systems  also.  Hence  for  any  sequentially  t-fault  diagnosable  systems  n  > 
2t+l. 

Theorem  3:  There  exists  a  class  of  designs  with  N=n+2t-2  that  are  sequentially  t-fault 
diagnosable  [Ref.  3:  p-852]. 

Proof:  Consider  the  following  design.  First,  connect  all  units  Uo,  Ui,  ....,Un-l  in  a 
loop  such  that  for  every  i  there  is  a  link  from  Ui  to  Ui+i  (all  subscripts  are  taken  modulo 
n).  Secondly,  select  a  subset  Si  of  2t-2  units  from  the  set  (Ui,  U2,  U3,  ...,Un-2)  and 
establish  a  link  from  each  unit  of  Si  to  Uo.  This  is  shown  in  Figure  2.4.  Let  the  number 
of  testing  signals  from  Si  and  Un-l  to  Uo  having  the  value  0  (1)  be  no  (ni).  The  following 
cases  are  possible: 

Case  1:  ni>t.  The  assumption  (Uo  is  not  faulty)  implies  that  ni  >  t  units  are  faulty, 
thus  violating  the  hypothesis  on  the  maximum  number  of  faulty  units.  Therefore  ni  >  t 
implies  Uo  is  faulty. 
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Case  2:  ni<t.  The  assumption  (Uo  is  faulty)  implies  that,  no  >  t-1  more  units  are 
faulty.  If  m  <  t,  ni+n2=2t-2  and  assume  ni=t-l.  So  no=2t-2-ni.  If  we  put  ni=t-l,  then  no 
=  t-1.  For  ni  =  t-2,  t-3  ..  and  so  on,  no  >  t-1  but  this  also  violates  the  hypothesis. 
Therefore  ni  <  t  implies  Uo  to  be  not  faulty. 

Case  3:  ni=t.  Let's  consider  the  set  S'=Si  U  Un-l  U  Uo  for  a  total  of  2t  units.  If  Uo 
is  not  faulty,  the  set  contains  ni=t  faulty  units;  if  Uo  is  faulty,  the  system  contains  Uo  and 
no  =  t- 1  additional  faulty  units,  for  a  total  of  t.  In  both  cases  the  set  contains  t  faulty  units. 
We  conclude  that  all  units  of  the  system  not  contained  with  in  the  set  S'  are  not  faulty  and 
at  least  one  fault-free  unit  can  be  identified.  Therefore,  ni  =  t  implies  the  existence  and 
identification  of  at  least  one  fault-free  unit. 

To  locate  at  least  one  faulty  unit  we  proceed  as  follows.  In  case  1,  Uo  is  the  faulty 
unit.  In  cases  2  and  3  we  have  located  at  least  one  fault-free  unit.  To  locate  a  faulty  unit 
we  simply  travel  along  the  loop  of  testing  links  in  the  direction  of  arrows.  We  follow  the 
test  signals  until  we  see  a  1  for  the  first  time,  the  unit  being  tested  by  this  link  is  faulty 
[Ref.  3:  p-852].  So  considering  all  of  the  three  cases  above,  we  have  identified  at  least 
one  faulty  unit;  which  is  necessary  and  sufficient  for  sequential  diagnosis. 


Figure  2.4  An  example  of  sequential  diagnosis  connection  for  n=14  and  t=6 
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D.  GENERALIZATION  OF  FAULTS 

tp-fault  diagnosability:  A  system  is  tp-diagnosable  if  and  only  if  the  application  of 
the  test  set  identifies  precisely  which  faults  are  present,  provided  the  number  of  faults 
does  not  exceed  tp  [Ref.  9].  (This  is  precisely  one-step  t-fault  diagnosability.) 

The  major  part  of  the  self-diagnosability  of  systems  has  assumed  that  only  permanent 
(solid)  faults  can  be  present.  Consideration  of  intermittent  faults  is  generally  difficult 
since  it  requires  a  modeling  of  the  behavior  of  these  faults  in  a  system  and  also  requires 
interactive  testing  strategies  to  detect  faults.  Mallela  and  Masson  [Ref.  10]  consider  the 
effect  of  intermittent  faults  in  diagnosable  systems.  The  existence  of  both  permanent  and 
intermittent  faults  in  a  system,  for  example,  affects  the  test  outcome  which  is  received 
after  repeated  applications  of  the  test  routines.  This  outcome  may  generate  an  incomplete 
diagnosis  of  faulty  units,  since  not  all  the  faulty  units  in  the  system  may  be  detected. 

ti-fault  diagnosability:  A  system  is  ti-fault  diagnosable  if  in  the  presence  of  ti 
intermittent  faults  no  fault-free  unit  will  ever  be  diagnosed  as  faulty,  and  diagnosis  will  be 
at  worst  case  incomplete  [Ref.  4]. 

In  general,  the  fact  that  a  system  is  tp-fault  diagnosable  does  not  necessarily  imply 
that  it  is  also  ti-fault  diagnosable.  Mallela  and  Mason  also  give  necessary  and  sufficient 
conditions  for  one-step  ti-fault  diagnosability. 

t/s-diagnosability:  A  multiprocessing  system  is  t/s-diagnosable  if  one  can  always 
identify  a  set  of  processors  of  size  s  or  less  which  contains  all  permanently  faulty 
processors,  provided  there  are  no  more  than  t-faulty  processors.  In  general,  t  <  s,  and  so 
there  is  a  relaxation  of  restriction  in  previous  studies  that  no  fault-free  processors  can  be 
replaced  [Ref.  7]. 

E.  SMITH'S  ALGORITHM: 

Consider  three  replacement  algorithms  [Ref.  8]  for  faulty  processors: 

STi:  At  each  step  perform  the  tests  and  replace  processors  which  fail  at  least  one 
test,  with  randomly  chosen  spares.  If  all  test  results  are  pass,  the  system  is  assumed  to  be 
correct. 
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ST2:  At  each  step,  perform  the  tests  and  replace  processors  which  fail  the  maximum 
number  of  tests.  Replaced  processors  are  placed  back  into  the  set  of  spares.  If  all  test 
results  are  pass,  the  system  is  assumed  to  be  correct. 

ST3:  At  each  step,  perform  the  tests  and  replace  processors  which  fail  the  maximum 
number  of  tests.  Put  these  into  the  SPARE-II  and  replace  them  with  randomly  selected 
spares  in  SPARE-I.  If  the  number  of  processors  in  SPARE-I  are  not  sufficient,  then 
choose  any  additional  needed  spares  randomly  selected  from  SPARE-IL  If  all  test 
results  are  pass,  the  system  is  assumed  to  be  correct  (initially,  all  spares  are  in  SPARE-I 
and  SPARE-II  is  empty). 

STi  is  fast  but  tends  to  replace  many  fault-free  processors  (those  which  fail  at  least 
one  test  by  fault-free  processors).  ST2  replaces  fewer  fault-free  processors,  but  it  is 
slower.  ST3  is  the  most  sophisticated,  since  it  tends  to  maintain  an  enrichment  in  the  set 
of  fault-free  processors,  and  resorts  to  selection  of  suspected  faulty  spare  processors  only 
when  necessary  [Ref.  8]. 

d-disabling  rule:  Processor  Ui  is  disabled  (e.g:  not  allowed  to  participate  in 
computation)  if  and  only  if  Ui  fails  d  or  more  tests  by  enabled  processors  [Ref.  7] 
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Figure  2.5  Five  processor  multiprocessor  system  for  two  arrangements  of  faulty 

processors 
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Consider  the  1 -disabling  rule  in  Figure  2.5(a)  and  assume  U2  and  U3  are  faulty  and 
enabled.  Then  U4  is  disabled  even  though  it  is  fault-free.  Uo  is  also  fault-free  and 
disabled.  However,  since  Ui  fails  no  test  and  it  will  become  enabled  permanently.  It 
follows  that  U2  and  U3  will  eventually  be  disabled.  Thus  fault-free  nodes  U4  and  Uo 
which  were  originally  disabled  will  become  enabled  permanently.  Consider  the  system  in 
Figure  2.5(b),  where  there  are  also  two  faulty  units,  and  assume  the  1-disabling  rule 
applies  as  before.  If  U2  and  U4  are  enabled,  before  any  of  the  processors  are  enabled,  the 
fail  test  outcomes  they  produce  disable  Uo,  Ui  and  U3.  Since  all  fault-free  processors  are 
disabled  and  the  tests  among  faulty  processors  are  pass,  both  faulty  processors  are 
enabled.  Unlike  the  case  just  discussed,  the  system  will  never  correct  itself.  Thus,  a 
permanent  situation  exists  where  all  faulty  processors  are  enabled  and  all  fault-free 
processors  disabled.  In  the  same  figure,  if  we  apply  the  2-disabling  rule  with  the  same 
initial  conditions  (e.g:  U2,  U4  are  faulty  and  enabled),  the  fault-free  processors  will 
eventually  become  disabled,  while  only  one  of  the  faulty  processors  will  be  disabled. 
Thus,  the  1-and  2-  disabling  rule  lead  to  an  unsatisfactory  diagnosis. 
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III.  PROBLEM 

A.    SIMPLE  DIAGNOSABILITY  TESTS  FOR  MULTIPROCESSING  SYSTEMS 

Recall  that  we  are  interested  in  distributed  fault  diagnosis  of  the  system,  since  ultra 
reliability  can  be  achieved  less  expensively.  The  basic  idea  behind  distributed 
self-diagnosis  is  that  the  diagnosis  algorithm  is  executed  on  the  remaining  intact  units  of 
the  system.  In  contrast  to  the  central  diagnosis  which  assumes  an  external  (perfect)  unit 
for  computing  diagnosis  results,  distributed  diagnosis  is  performed  throughout  the  system. 
First,  a  node  is  diagnosed  by  its  immediate  neighboring  nodes.  In  a  second  step,  these 
local  diagnosis  results  are  used  to  disable  processors. 

To  achieve  distributed  fault  diagnosis  in  a  system,  each  unit  is  equipped  with 
disabling  circuitry.  Thus,  testing  processors  can  determine  the  status  of  the  tested 
processor.  The  problem  of  identifying  how  many  faulty  processors  can  be  tolerated 
before  it  is  impossible  to  correctly  identify  them  is  a  very  difficult  task  in  general 
multiprocessing  systems.  For  example,  in  some  cases  as  is  shown  in  Chapter  II,  Figure 
2.3,  the  two  different  fault  patterns  produce  the  same  test  outcome  (syndrome). 

The  problem  of  locating  faulty  processors  within  a  multiprocessor  system  by 
temporarily  halting  normal  operation  and  placing  it  in  a  diagnostic  mode  has  been 
studied  using  the  PMC  model.  When  the  number  of  modules  in  the  system  is  large,  some 
of  them  will  be  idle  at  a  given  moment.  A  test  may  be  any  sort  of  check  by  one  processor 
on  the  operation  of  the  other,  including  applying  test  vectors  and  checking  resulting 
outputs.  In  a  concept  introduced  by  Nair,  Metze,  Abraham  [Ref.  9]  called  "roving 
diagnosis".  One  part  of  the  system  diagnoses  a  second  part,  while  the  remainder  of  the 
system  continues  normal  operation.  The  part  most  recently  diagnosed  as  fault-free  then 
takes  its  turn  in  diagnosing  other  pans.  Thus,  there  appears  to  be  a  subsystem  of 
diagnosing  and  diagnosed  units  which  "roves"  through  the  system  until  no  parts  of  it 
remains  undiagnosed.  However  roving  diagnosis,  must  ensure  that  first  diagnosis  will 
produce  unique,  identifiable  results.  The  checks  are  performed  at  the  system  level  on  data 
elements  that  constitute  the  results  of  computations  on  these  systems.  It  is  assumed  [Ref. 
10:  298]  that  each  processor  has  a  local  memory  on  which  it  performs  reads  and  writes. 
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In  addition,  it  can  communicate  with  other  processors  in  the  system  through  the  buffers  at 
various  input  and  output  ports.  A  processor  cannot  read  or  write  from  any  other 
processor's  local  memory  even  in  the  presence  of  a  fault.  A  fault  is  any  condition  that 
causes  a  malfunction  in  a  single  processor  while  performing  operations. 

B.  RECONFIGURATION 

Definition  5:  A  system  is  c-correctable  using  the  d-disabling  rule  if  and  only  if: 

1 .  All  faulty  nodes  are  eventually  permanently  disabled. 

2.  All  fault-free  processors  are  eventually  permanently  enabled  provided  there  are  c 
or  fewer  faulty  nodes  [Ref.  7]. 

The  main  goal  in  system  configuration  is  to  switch-in  all  fault-free  units  and  to 
switch-out  all  faulty  units.  But  this  switching  is  not  between  two  working  systems,  just 
between  working  system  and  spares.  The  goal  is  not  only  to  switch-out  the  faulty  units 
but  also  keep  the  working  system  functional.  That  gives  more  flexibility  to  the  system 
but  increases  the  cost.  The  problem  is  to  derive  a  distributed  strategy  for  correct 
switching  which  is  insensitive  to  the  arrangement  of  faulty  processors.  Sometimes  it  may 
be  difficult  to  replace  a  specific  processor,  so  rearrangement  of  applied  tests  can  give 
more  accurate  results.  A  flexible  test  arrangement  will  allow  an  approach  which  views 
the  diagnostic  task  as  one  of  arranging  processors  into  two  groups,  a  working  group  and  a 
spare  group.  Another  approach  is  to  have  three  groups,  one  group  for  critical  operations, 
one  for  noncritical  operations,  and  one  for  spares.  However  in  this  thesis,  we  will 
consider  only  the  first  approach. 

C.  RELATIONSHIP  BETWEEN  ENABLED/DISABLED  UNITS  AND  SYSTEM 
RELIABILITY 

In  an  implementation  of  distributed  diagnosis,  to  have  correct  diagnostics,  two  major 
important  problems  must  be  considered: 

1.  Reliable  implementation  of  the  disabling  criteria  and  function. 
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2.  Reliable  transmission  of  appropriate  test  (pass,  fail)  and  result  signals  of  disabling 
criteria  (enabled  or  disabled)  for  system  units. 

It  should  be  noted  that  in  distributed  diagnosis,  only  local  information  is  used  to 
identify  faulty  processors.  In  central  diagnosis  all  test  results  are  used.  Thus,  we  would 
expect  distributed  diagnosis  to  be  less  accurate.  This  manifests  itself  in  a  fewer  number 
of  faulty  nodes  which  can  be  tolerated  in  distributed  diagnosis. 
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IV.  METHOD  OF  APPROACH 

A.  WHY  A  CAD-TOOL? 

Our  approach  to  the  problem  of  developing  diagnosis  strategies  is  to  develop  a  CAD 
(Computer  Aided  Design)  tool  for  the  simulation  of  different  fault  patterns  and  different 
reconfiguration  strategies.  Previously  all  studies  have  used  hand  calculations  for  this 
purpose.  When  the  number  of  units  in  the  system  has  increased  to  more  than  seven,  hand 
calculations  becomes  complex.  Thus,  the  user  can  only  simulate  a  limited  number  of 
units  and  fault  patterns.  Using  the  CAD-tool,  the  user  can  simulate  from  2  to  20  units 
with  various  fault  patterns.  The  restriction  of  20  units  is  due  to  limitations  of  the  monitor 
screen. 

Thus,  the  tool  facility  gives  the  user  an  opportunity  of  simulating  a  large  number  of 
units  and  fault  patterns  in  a  system.  The  number  of  units  in  a  network  is  known  in 
advance  and  can  be  predefined  in  to  the  tool-program.  The  names  and  number  of  faulty 
nodes  are  determined  by  the  user.  Testing  connections  can  be  predefined  by  the  user  or 
the  program.  Only  the  test  procedure  (worst_case  or  user_defined_case)  can  be  chosen  by 
the  user.  Also  the  user  defines  the  disabling  criteria.  After  input  by  the  user,  the 
CAD-tool  determines  test  results,  disabled,  enabled  units  and  then  displays  the  system  in 
a  control  unit  monitor.  By  using  the  CAD-tool,  a  computer  network  is  automatically 
controlled  without  any  hand  calculation. 

B.  TOOL  DEFINITIONS: 

This  CAD-tool  is  written  in  the  C  programming  language  [Ref.  12]  using  PMC  graph 
model.  The  terms  used  in  the  program  are  listed  below  and  given  short  explanations: 

N=The  number  of  units  in  the  system  (may  change  from  1  to  20). 

f=The  number  of  faulty  nodes(  0  <  f  <  N-l). 

T=The  number  of  units  which  tests  one  unit.  This  number  is  the  same  for  all  units. 
Test  results  according  to  test  connection  are  determined  by  the  program  reflecting  the  user 
desire  as  a  worst-case  or  arbitrary  case. 
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For  the  worst-case,  the  program  itself  determines  all  test  results.  That  is,  faulty  testing 
units  produce  fail  (1)  test  outcome  for  fault-free  and  pass  (0)  test  outcome  for  faulty 
tested  units.  This  information  is  completely  opposite  to  the  status  of  the  units.  This  is  the 
reason  it  is  called  worst  case.  For  the  user  defined  (arbitrary)  case,  test  outcomes  for 
faulty  testing  units  (for  faulty  or  fault-free  tested  units),  are  defined  by  the  user. 

d=Is  the  disabling  criteria  which  is  defined  by  the  user.  If  a  tested  unit  has,  at  least  d 
fail  test  outcomes  by  enabled  units,  the  unit  will  be  disabled. 

C.    TOOL  SPECIFICATION 

Figure  4.1  shows  the  flowchart  of  the  main  body  of  the  system  tool.  As  can  be  seen, 
the  user  can  specify  initial  conditions  and  then  allow  the  system  to  execute  diagnostic 
steps  one  after  the  other. 

Figure  4.2  shows  a  more  detailed  flowchart  of  the  program.  First,  the  user  defines  the 
number  of  units  in  the  system.  If  this  number  is  less  than  0  or  greater  than  20,  the 
program  produces  an  error  message.  The  user  defines  the  number  and  the  names  of  faulty 
nodes.  Next,  the  user  defines  T  (the  number  of  units  testing  one  unit)  and  the  test 
procedure  (as  worst  case  or  arbitrary  case).  The  program  determines  the  test  results  and 
displays  them  onto  the  screen.  The  user  defines  the  disabling  criteria,  the  number  and 
names  of  enabled  units  (all  units  are  disabled  initially).  The  tool  displays  the  whole 
system  in  the  initial  conditions  by  calling  the  subroutine  drawing. 
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Figure  4. 1  Flow  chart  of  CAD-tool 
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Figure  4.2  Detailed  flow  chart  of  CAD-tool 
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To  see  the  application  of  the  disabling  rule,  the  user  selects  option  #5  from  the  menu 
shown  in  Table  4.1.  Then,  the  program  determines  the  enabled  and  disabled  units  and 
displays  the  first  iteration  by  calling  the  drawing  subroutine.  The  user  can  go  onto  more 
iterations  with  the  same  conditions.  After  some  number  of  iterations,  the  user  can  exit  the 
program  or  go  back  to  the  beginning,  where  he/she  can  simulate  another  system  with 
another  conditions. 

1.  INTRODUCTION 

2.  SYSTEM  SETJJP 

3.  SET  TEST  RESULTS 

4.  SET  THE  DISABLING  CRITERIA 

5.  APPLY  DISABLING  RULE 

6.  EXIT 

Table  4.1  Menu  of  CAD-tool 

D.    TOOL  REALIZATION 

The  CAD  tool  is  made  up  of  five  main  parts  (subroutines).  The  first,  menu  option  #1, 
gives  a  brief  explanation  of  the  program.  Option  #2  sets  up  the  type  of  system,  number 
and  names  of  units,  number  and  names  of  faulty  units.  Option  #3  sets  up  T,  and  test 
procedure.  Option  #4  sets  up  the  disabling  criteria,  number  and  the  names  of  the  enabled 
units.  Then  it  displays  the  system  initial  conditions  calling  the  subroutine  drawing. 
Option  #5  applies  the  disabling  criteria  and  determines  the  enabled  and  disabled  units, 
then  it  displays  the  system.  In  the  drawing  subroutine,  enabled  fault-free  units  are  green, 
enabled  faulty  nodes  are  also  green  with  X's  inside  circles.  Disabled  fault-free  nodes  are 
red  and  disabled  faulty  nodes  are  red  with  X's  inside  circles.  Test  results  are  represented 
by  the  color  of  testing  arrows.  A  green  arrow  means  a  pass  (0)  test  outcome,  and  a  red 
arrow  means  fail  (1).  Each  time,  after  going  through  each  option,  the  menu  comes  onto 
the  screen.  So  if  the  user  makes  a  mistake  somewhere  in  the  program,  he/she  can  correct 
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it  easily,  choosing  the  same  option  from  the  menu.    The  main  part  of  program  is  very 
straightforward  and  just  calls  the  subroutines  according  to  selected  menu  options. 
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V.  RESULTS 

Figure  5.1  shows  a  photograph  of  the  CAD-tool  menu.  Figures  5.2  through  5.5 
shows  the  initial  condition  and  three  step  iterations  of  a  five  unit  multiprocessor  system. 
In  this  system  U2  and  U3  are  faulty  and  enabled  initially  and  shown  with  color  green; 
other  units  are  disabled  and  shown  with  color  red.  The  disabling  criteria  is  1  and  the  test 
results  are  the  worst  case.  After  the  first  iteration  units  Uo  and  U4  are  disabled  (red)  and 
all  the  other  units  are  enabled  (green).  After  the  second  iteration  Ui  is  enabled  and  all 
the  other  units  are  disabled.  After  the  third  iteration,  all  faulty  units  are  disabled  (U2,  U3) 
and  all  fault- free  units  arc  mabled.  In  this  case,  the  1 -disabling  criteria  gives  the  desired 
results.  This  example  is  explained  in  Appendix  B  as  Case  1. 


Figure  5.1  CAD-tool  menu  and  test  outcomes 
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, 


Figure  5.2  Initial  condition 


Figure  5.3  First  iteration 
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Figure  5.4  Second  iteration 


Figure  5.5  Third  iteration 


35 


Figures  5.6  through  5.9  show  another  five  unit  multiprocessing  system.  In  this 
example,  Ui  and  U4  are  faulty  and  enabled  initally.  Disabling  criteria  is  2  and  test  results 
are  also  worst  case.  After  the  first  iteration  all  units  are  enabled.  After  the  second 
iteration  only  U4  is  disabled  and  all  the  other  units  are  enabled.  Figure  5.8  and  Figure  5.9 
both  are  the  same.  This  means  that  the  system  stays  in  that  state  and  cannot  correct  itself. 
This  example  is  explained  in  Appendix  B  as  Case  3. 


Figure  5.6  Initial  condition 
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Figure  5.7  First  iteration 


Figure  5.8  Second  iteration 
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Figure  5.9  Third  iteration 

Figures  5.10  through  5.14  show  a  seven  unit  multiprocessor  system.  In  this  system, 
Ui,  U3,  U5  are  faulty  units  and  enabled  initially.  Test  results  are  also  worst  case  and 
disabling  criteria  is  2.  After  the  first  iteration,  U4  and  U6  are  disabled,  all  the  other  units 
are  enabled.  After  the  second  iteration  U3,  U4,  U6  are  disabled  and  the  other  units  are 
enabled.  After  the  third  iteration  only  U3  is  disabled.  After  the  fourth  iteration  all  faulty 
units  are  disabled  and  all  fault-free  units  are  enabled.  This  indicates  the  2-disabling 
criteria  works  and  the  system  corrects  itself.  This  example  is  explained  in  Appendix  B  as 
Case  6. 
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Figure  5.10  Initial  condition 


Figure  5.11  First  iteration 
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Figure  5.12  Second  iteration 


Figure  5.13  Third  iteartion 
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Figure  5.14  Fourth  iteration 

Figures  5.15  through  5.20  show  a  six  unit  system.  In  this  system  Ui,  U3,  U5  are 
faulty  units  and  the  disabling  criteria  is  2.  Test  results  are  arbitrary  (user  defined)  and  are 
defined  as  followes:  faulty  testing  units  produce  fail  (1)  test  outcome  for  faulty  tested 
units  and  produce  pass  (0)  outcome  for  fault-free  tested  units.  In  this  example,  faulty 
units  are  alternately  disabled  and  enabled.  Thus  the  system  will  never  correct  itself.  It 
displays  an  oscillation  of  period  six.  This  example  is  explained  in  Appendix  B  as  Case 
19. 
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Figure  5.15  Initial  condition 


Figure  5. 16  First  iteration 
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Figure  5.17  Second  iteration 


Figure  5.18  Third  iteration 
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Figure  5.19  Fourth  iteration 


Figure  5.20    Fifth  iteration 
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VI.  CONCLUSIONS  AND  RECOMMENDATIONS 

A.  CONCLUSION 

This  thesis  introduces  distributed  diagnosis.  The  analysis  of  distributed  diagnosis  is 
difficult  without  a  CAD  tool.  In  this  research,  a  CAD-tool  has  been  developed  based 
upon  the  PMC  graph  model.  Using  this  tool,  the  user  can  simulate  various  number  of 
configurations  and  fault  patterns.  The  tool  provides  a  step  by  step  procedure  for  user  to 
follow.  In  this  tool,  the  information  related  to  the  faulty  nodes  (the  numbers  and  the 
names  of  faulty  nodes)  is  provided  by  the  user.  Then  the  user  simulates  the  system  as 
much  as  wanted. 

In  the  CAD-tool,  fail  test  outcomes  by  enabled  porcessors  for  each  unit  are  counted 
and  compared  with  the  disabling  criteria.  If  fail  test  outcomes  exceed  the  criteria,  then  the 
unit  is  disabled.  Unlike  the  central  diagnosis  algorithm  which  eventually  settled  on  a  final 
arrangement  of  processors,  the  algorithm  denoted  here  develops  dynamic  behavior. 

B.  RECOMMENDATIONS 

It  is  expected  that  this  tool  will  be  used  to  study  optimum  disabling  criteria  for  various 
systems.  For  example,  we  hope  that  it  will  free  the  user  of  the  tedium  of  generating 
examples,  allowing  him  to  prove  properties  of  the  system.  One  possibility  is  that  it  could 
be  used  in  a  knowledge  base  system,  which  would  be  used  to  prove  properties  of  the 
disabling  criteria. 
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APPENDIX  A 


SOURCE  CODE 

/*  This  menu  helps  the  user  to  determine  the  main  selections  of  the  program. 
If  the  user  wants  to  run  the  program  for  very  FIRST  TIME  should  choose  the 
option  #2. To  choose  INTRODUCTION  is  outside  this  restriction.*/ 

char  fault_array[20] , disable_array [20] , dis_res_array [2] 


char 

int 
int 
int 
int 


U; 

test_array[20] [20] ; 

N , fmax , f , T , k , j , i , no_units_set , p ,  w, 1 , dis_crit .count ; 

no_en_set ; 

response ; 

menu( ) 

int  response; 


{ 


printf("   \n" 


printf ( " 
printf ( " 
printf ( " 
printf ( " 
printf ( " 
printf ( " 
pr intf ( " 
printf ( " 
printf ( " 
printf ( " 
printf ( " 
printf ( " 
printf ( " 
printf ( " 


M 


N   U 


1 .  INTRODUCTION 

2.  SYSTEM  SET_UP 

3.  SET  THE  TEST  RESULTS 

4.  SET  THE  DISABL.  CRITERIA 

5.  APPLY  DIS. CRITERIA 

6.  EXIT 


\n" 
\n" 

\n" 
\n" 
\n" 
\n" 
\n" 
\n" 
\n" 
\n" 
\n" 
\n" 
\n" 
\n" 


printf ("  \n\n" 


printf ("ENTER  THE  OPTION  NUMBER  FROM  THE  MENU  \n\n" ) 
} 


introduction( ) 


{ 

printf ( ' 
pr intf ( ' 
printf ( ' 
printf( ' 
pr intf ( ' 
pr intf ( ' 
pr intf ( ' 
printf( ' 
pr intf ( ' 
pr  intf (  ' 
pr intf  ( ' 
pr  intf ( ' 
printf ( ' 


# 

THESIS  TOPIC:  FAULT  TOLERANT  COMPUTING 

* 

*  IN  DISRIBUTED  COMPUTER  NETWORKS. 
# 

*  Author:  Ibrahim   DINCER 

* 

Thesis  Advisor:   Prof.  Jon   T.   BUTLER 

*  NAVAL     POSTGRADUTE     SCHOOL 
* 

ELECTRICAL  AND  COMPUTER  ENGINEERING 


\n") 
\n") 

\n") 
■\n") 
An") 
An") 
An") 
An") 
h\n") 
>\n") 
An") 
An") 
An") 
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pr intf 
pr intf 
pr intf 
pr intf 
pr intf 
pr intf 

pr intf 

pr  intf 
pr  intf 
pr  intf 
pr intf 
pr intf 
pr intf 
pr intf 
pr intf 
printf 
pr intf 
printf 
printf 
printf 
printf 
printf 
pr  intf 
printf 
printf 
printf 
printf 
printf 

} 


*  *\n"); 
EXTENSION  :3299  *\n"); 

*  *\n"); 

*  DATE   :  APRIL  23,1987  *\n"); 

This  program  is  for  simulation  of  distributed    \n" ) ; 

diagnosis  algorithm  in  a  computer  network. For  this\n"); 

purpose  PREPARATA_METZE_CHIEN  is  used.  The  number\n") 

\n") 
of  nodes  in  the  system  is  restricted  TO  NO  MORE   \n"  ) 

\n") 

THAN  20. The  user  enters  the  number  of  nodes , faulty\n" ) 

\n") 

nodes  in  the  network, test  procedure  and  disabling  \n" ) 

\n") 

criteria. The  program  displays  the  network, test     \n"  ) 

\n") 
\n") 
\n») 
\n") 
\n") 
\n") 
\n") 
\n») 
\n") 

FMAX=NUMBER  OF  ALLOWED  FAULTY  NODES  IN  THE  SYSTEM  \n" ) 

\n") 

T=  NUMBER  OF  UNITS  WHICH  ARE  TESTING  ONE  UNIT      \n"  ) 


outcomes  and  shows  enabled  fault_free  nodes  and 
disabled  faulty  nodes. 
N=  NUMBER  OF  NODES  IN  THE  SYSTEM 

D=DISABLING  CRITERIA  FOR  FAULTY  NODES 

F=  NUMBER  OF  FAULTY  NODES  IN  THE  SYSTEM 


/*  THIS   SUBROUTINE  DEFINES   THE  NAMES  OF  NODES  AND  ALSO  DEFINES  THE  FAULTY 
NODES   IN  THE  SYSTEM   */ 
units(  ) 
{ 
printf ("  THE  UNITS  OF  THE  SYSTEM  ARE\n\n" ) ; 
for(i=0;  i<N;  ++i) 
{ 
printf( ,,%c%d,"  , 'U' ,i); 
} 
printf ( "\n" ) ; 

printf ("  ENTER  THE  NUMBER  OF  FAULTY  NODES  \n" ) ; 
scanf("*d" ,&f ) ; 
/*  THIS  TWO  LOOPS  KEEP  THE  USER  IN  THE  ALLOWED   LIMITS  FOR 
FAULTY  UNITS*/ 

while(f<=0  ! !  f >N) 


{ 


printf("  'F'  SHOULD  BE   GREATER  THAN   ZERO  "); 
printf ("   AND  LESS  THAN  N  \n"); 
scanf("*d",&f ); 
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} 

printf( "THERE  ARE  %d   FAULTY  NODES   \n\n",f); 

0  =  1 ; 

/*  INDICATES  THE  ARRAY  TO  DEFINE  THE  FAULTY  NODES  »/ 
for( i  =  0;  i<=N-1  ;  +  +  i  ) 
{ 
fault_array[i]='G' ;/*INITIALLY  ALL  NODES  ARE  GOOD*/ 
} 
no_units_set= 1 ; 

printf ( "ENTER  THE  FAULTY  UNIT  NUMBER  ONE  AT  A  TIME  \n" ) ; 
while(no_units_set  <=f)/*REPEAT  UNTIL  'F'  UNITS  ENTERED 
{ 
scanf ( "%d" ,&i  )  ; 
while( i  >(N-1  )  I  !  i<0) 
{ 
printf ("  UNIT   NUMBER  IS  NOT  VALID,  TRY  AGAIN  ]\n") 
scanf ("^d" ,&i); 
} 
if  (fault_array[i! == 'B*  ) 
{ 
printf ("THIS  UNIT  IS  PREVIOUSLY  DEFINED  AS  "); 
printf ("   FAULTY, TRY  AGAIN  ]\n\n"); 
} 
else 
{ 
fault_array[i]  =  '  B'  ; 
printf ("  FAULTY  UNIT  ft   *d  IS  U#d  \n\n",j,i); 

++j ; 

++no_units_set ; 
} 
} 
} 


I*    THIS  SUBROUTINE  SETS  UP  THE  SYSTEM  TO  BE  TESTED  »/ 


sys_set_up( ) 
{ 

printf("  TO  DETERMINE  THE  NETWORK  ENTER  ONE  OF  THE  "); 
printf("  OPTIONS  BELOW\n" ) ; 
printf("\n"); 

printf("        1. DESIGN   \n\n"  )  ; 
printf("        2. ARBITRARY  SYSTEM  \n\n" ) ; 
scanf ("*d" ,ftp); 
printf ( "p=#d\n\n" ,p) ; 

if   (P==1) 
{ 
printf ("ENTER  THE  NUMBER  OF  NODES  IN  THE  SYSTEM\n\n" ) 
scanf ( "#d" ,&N); 
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while(N>20  ! !  N<=0) 
{ 

printf("THE  NUMBER  OF  UNITS  IS  NOT  VALID,"); 

printf ( "  TRY  AGAIN  \n" ); 

scanf( "#d",&N) ; 

printf( "N=*d\n\n" ,N); 

} 
units (  )  ; 
} 
else 

{ 
printf ("  THIS  SYSTEM  WILL  BE  DEFINED    LATER  \n\n"  )  ; 

} 
} 


/*  THIS  SUBROUTINE  DETERMINES  THE  TEST  RESULTS  FOR  THE  SYSTEM. IN  THE 
•VORST_CASE' .PROGRAM  DETERMINES  ALL  THE  TEST  RESULTS;  FOR  THE  ARBITRARY  CASE 
TEST  RESULTS  FOR  THE  TESTED  UNITS  BY  'FAULTY'  TESTING  UNITS  WILL  BE  DEFINED 
BY  THE  USER.  »/ 

test  (  ) 
{ 
printf("  'T'  IS  THE  NUMBER  OF  UNITS  TESTING  ONE  NODE; ENTER"  )  ; 
printf(" 'T' \n" ) ; 
scanf("*d" ,&T); 

printf ("  DO  YOU  WANT  'WORST_CASE'  TEST   RESULTS7IF   YES.ENTER"); 
printf ( "1 \n" ) ; 

scanf( "%d" ,&w); 
printf ( "w=£d\n" ,w) ; 
if  (w==1 ) 
{ 
for  ( j=1  ; j<=T;++j  ) 
{ 
for  (k=0;k<=N-1 ;++k) 
{ 
l=k-j ; 
if  (1<0) 
{ 
1=1+N; 
} 
if((fault_array[k!=='B' )  &&( f ault_array [ 1 ! == * B' ) ) 
( 
test_array[k! [j ! =0 ; 
} 
else  if( (fault_array[k' == 'B' )  ! ! ( f ault_array [1 ! == ' B ' ) ) 
{ 
t  e  s  t_ar  r  ay [ k ! [ j ! = 1  ; 
} 
else 
{ 
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{ 

t  e  s  t_ar  ray[k! [  j !  =  0 ; 
} 
} 
} 
} 
else  /*THIS  PART  user_def ined  ARBITRARY  TEST  RESULTS  »/ 
{ 
for  ( j=1  ;j  <=T;+  +  j  ) 
{ 
for  (k=0 ;k<=N-1 ;++k) 
{ 
l=k-j; 
if  ( 1<0) 
{ 
1=1+N; 
} 
if  (fault_array[l]== 'B' ) 
{ 
printf("TEST  RESULT  NODE  #*d  BY  NODE  #  #d  IS   ",k,l); 
scanf ( "#d" ,&test_array [k] [j ]  )  ; 

while(test_array[k] [j ]=0  &&  test_array [k] [ j ] = 1 ) 
{ 
printf("TEST  RESULTS  SHOULD  BE  0  OR  1  \n"  )  ; 
scanf("#d'\&test_array[k]  [j]  ); 
} 

printf ( "test_array|>d] [^d] =^d\n" ,k , j , test_array [k] [ j ] ) ; 
> 

else  if  (  fault_array[k]== 'B' ) 
{ 
test_array[k] [j ]  =  1  ; 
} 

else 
{ 
t  e  s  t_ar  ray[k][j]=0; 
} 
} 
} 
} 
for(k=0;k<=N-1 ;++k)  /»THIS  PART  PRODUCES  TEST_RESULT  MATRIX  */ 
{ 
for( j=1  ;  j<=T;+  +  j ) 
{ 
printf  ("  %d      " , test_array [k] [ j ] ) ; 

} 
printf ( "\n\n" ) ; 
} 
}  I*      END  OF  TEST  SUBROUTINE  »/ 

/*"THIS  PART  OF  PROGRAM  IS  DRAWING  THE  NETWORK  FOR  DISPLAY"  »/ 
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^include  < device .h> 
^include  <gl.h> 
^define  resetls  TRUE 


drawing( 


{ 


int  i  ,  j , k , x ,y , x1 ,  y1  , X2,y2,x3,y3,x4,y4,x5,y5; 

int  x6,y6 , x7 ,y7 ,x8 ,y8 , t ,r ,R; 
char  number [20] , Z ; 
float  pi , theta.phi . rho ,psi , tau; 
short  ang; 
pi  =  3.  U16295; 
ginit(  )  ; 

viewport  (400 ,1000,100, 700); 
cursoff (  )  ; 
color(BLUE)  ; 
clear (  )  ; 
linewidth( 4  )  ; 

or tho2( -350. 0,350. 0,-350. 0,350.0)  ; 
R  =  300; 
r  =  20; 

x=R*cos(pi/2) ; 
y=R»sin(pi/2) ; 
x3=x+r*cos(5*pi/4); 
y3=y+r*sin( 5*pi/4) ; 
x4=x+r*cos(pi/4 ) ; 
y4=y+r*sin(pi/4)  ; 
x5=x+r*cos(7*pi/4 ) 
y5=y+r*sin(7»pi/4) 
x6=x+r*cos(3*pi/4) 
y6=y+r*sin( 3*pi/4 ) 
for(k=0;k<=N-1 ;++k) 
{ 
i=k+1 ; 
if(i>=N) 
{ 
i=i-N; 
} 
ang=(-3600.0/N); 
rotate( ang, ' Z ' ) ; 
while (getbutton(M0USE3 
if  (fault_array[i]== 'B 
{ 
color(RED) ; 
circf i( x,y , r  )  ; 
color(BLACK) ; 
move2i(x3 ,y3 ) 
draw2i( x4 ,y4 ) 
move2i ( x5 ,y5  ) 
draw2i ( x6 ,y6 ) 


1  =  1  ); 

&&   disable_array [i] 
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} 

i  f ( f aul t_ar  r  ay [ i ]  =  = ' B 
{ 
color(GREEN); 
circf i(x,y, r ) ; 
color(BLACK); 
move2i(x3 ,y3 ) 
draw2i ( x4 ,  y4 ) 
move2i( x5 ,y5 ) 
draw2i( x6 ,y6 ) 
} 

i  f ( f aul t_ar  r  ay [ i ]  = 
{ 


&&   disable_array[i]== 'E' ) 


&&   disable_array[i] == ' E' ) 


color ( GREEN); 

circf i ( x ,y , r ) ; 

} 
if(fault_array[i]== 'G'  &&  disable_array [i] == ' D ' ) 
{ 

color(RED) ; 

circf i( x ,y , r  )  ; 

} 
color(WHITE); 
cmov2i( x+30,y+30); 
spr intf ( number , "U#d" , i ) ; 
charstr ( number ) ; 
for(j«1 ; j<=T;++j ) 


{ 


if(l>=N) 
{ 
1-1-N; 

} 
if  (test_array[l] [j]==1 ) 
{ 

color(RED) ; 

} 
if(test_array[l][j]==0) 

{ 
color(GREEN); 
} 
theta=2*pi+(pi/2)-(2»pi/N)»j ; 
phi=pi/2-(pi/N)»j ; 
rho=pi/2-(2*pi/N)*j ; 
psi=(pi/N)»j ; 
tau=pi/6 ; 
x1 =r *sin(phi ) ; 
y 1 =R-r*cos(phi ) ; 
x2=R*cos(theta)-r*cos(phi-rho); 
y2=R»sin(theta)+r*sin(phi-rho) ; 
x7=x2-r*sin(pi/2-psi-tau/2) ; 
y7=y2+r*cos(pi/2-psi-tau/2) ; 
x8=x2-r#cos(psi-tau/2); 
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x7=x2-r*sin(pi/2-psi-tau/2) ; 
y7=y2+r*cos(pi/2-psi-tau/2) ; 
x8=x2-r*cos(psi-tau/2 ) ; 
y8=y2+r*sin(psi-tau/2 ) ; 
move2i( x1  ,y1  ) 
draw2i(x2,y2) 
draw2i(x7,y7) 
move2i(x2 ,y2) 
draw2i(x8,y8) 
} 
while(getbutton(M0USE1 )  ]=1); 
} 
gexit( ) ; 

/*  THIS  PART  DETERMINES   THE  INITIAL  CONDITIONS  AND  DISPLAYS 
THE  SYSTEM  IN  INITIAL  CONDITIONS  */ 
disable( ) 


{ 


printf( "ENTER  THE  NUMBER  OF  ENABLED   NODES\n" ) ; 

scanf ( "£d" , &no_en_set ) ; 

printf( "ENTER  THE  MINIMUM  NUMBER  OF  FAIL  TEST  RESULTS  BY"); 

printf( "ENABLED  PROCESSORS  WHICH   DISABLE  THE  TESTED  "); 

printf( "PROCESSOR  \n" ) ; 

scanf ( "%d" ,&dis_crit ) ; 

for( i  =  0; i  <  =N- 1  ;+  +  i  ) 

{ 
disable_array[i] = ' D' ; 
} 

count=0 ; 

d-1 ; 

printf( "ENTER  THE  ENABLED   UNIT  NUMBER  ONE  AT  A  TIME  \n" ) ; 
while( count <nc_en_set )  /*  repeat  until  all  units  are 

entered  */ 
{ 
scanf ("#d",&i); 
if  (i>N-1  ! ! i<0) 
{ 
printf("UNIT  NUMBER  IS  NOT  VALID, TRY  AGAIN  ]\n"); 
} 
else  if  ( disable_array [i] == ' E' ) 
{ 
printf("THIS  UNIT  IS  PREVIOUSLY  DEFINED  AS  ENABLED,") 
printf("TRY  AGAIN  ]\n\n"); 
} 
else 
{ 
disable_array[i] = ' E' ; 

printf( "ENABLED  UNIT  #%d    IS  U*d\n\n" ,j , i  )  ; 
pr intf ( "disable_array [#d] =#c\n" , i , disable_array [i] ) ; 
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} 

drawing( ) ; 
} 


/*  THIS  PART  OF  THE  PROGRAM  DETERMINES  ENABLED  AND  DISABLED 

NODES  AFTER  THE  ITERATION , DISPLAYS  THE  SYSTEM  */ 

apply(  ) 
{ 
for  (k=0;k<=N-1 ;++k) 
{ 
count=0 ; 

for( j=1  ; j<=T;+  +  j  ) 
{ 
l=k-j; 
if(l<0) 
( 
1=1+N; 
} 
if ( ( test_array[k] [j ] ==1 )  &&   ( disable_array [1] == ' E' ) ) 
{ 
++count ; 
} 
if  ( count >=dis_cr it ) 
{ 
d  i  s_r  e  s_ar  r  ay [ k ]  =  ' D ' ; 
printf("\n") ; 
) 
} 
if  ( count<dis_cr it ) 
{ 
dis_res_array[k] = 'E' ; 
}  ' 
} 

for(k=0;k<=N-1 ;++k) 
{ 
printf ( "dis_res_array[£d] =*c\n" ,k, dis_res_array [k] ) ; 
} 
for(k=0;k<=N-1 ;++k) 
{ 
d  i  s  ab 1 e_ar  ray[k]=di  s_r  e  s_ar  r  ay [ k ] ; 
} 
drawing( ) ; 
printf ( "\n\n" ) ; 
} 
^include  "gl.h" 
^include  <stdio.h> 
^include  <device.h> 

main( ) 
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{ 

ginit( ) ; 

cursoff ( ) ; 
color(WHITE) ; 
clear ( ) ; 

textport( 0,350. 1 0,900 ) ; 
linewidth( 6 ) ; 
whi le( response  ]=6) 
{ 

menu( ) ; 

scanf ( "%d" , ^response ) ; 

if  (  response—  1  ) 

introduction( ) ; 
if  ( response==2 ) 

sys_set_up( ) ; 
if  ( response==3 ) 

test(  ) ; 

if  ( response  =  =  4 ) 

disable ( ) ; 

if  ( response==5 ) 

apply(  ) ; 
} 
printf("    PROGRAM   IS  OVER   \n" ) ; 
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APPENDIX  B 
HAND  CALCULATION  OF  DIFFERENT  CASES 

Case  1.  A  five  unit  multiprocessor  system,  U2  and  U3  are  faulty  units  and  shown 
underlined.  Test  results  are  worst  case,  disabling  criteria  is  1.  In  the  matrix  shown  below 
testing  units  are  placed  on  the  x  axis,  tested  units  are  placed  on  y  axis. 
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La 

Uo        0 
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Uo 

U4 

Ui        0 

0 

Ui 

Uo 

112      1 

1 

U2 

Ui 

III        0 

1 

III 

Ifc 

U4        1 

1 

a.  first  iteration  with  I.C 

Ui,U2,  U3are 

:  enabled 

Uo,  U4  are  disabled 

b.  second  iteration 

Ui  is  enabled 

Uo,U2,U3,U 

[  are  disabled 

c.  third  iteration 

Uo,  Ui.ltoare 

enabled 

U2,  U3  are  disabled 

*  all  faulty  nodes  are 

disabled,  all  fault-free  nodes  are  enabled 
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Case  2.  A  five  unit  multiprocessor  system,  with  U2  and  U3  are  faulty  units  and 
enabled  initially.  Test  results  are  arbitrary  (user  defined)  case  and  disabling  criteria  is  1. 
Arbitrary  test  results  have  shown  underlined. 

U4  U3 

Uo        0  Q 

Uo  U4 

Ui         0  0 

Ui  Uo 

112         1  1 

112  Ui 

Ua      Q        i 
Ua     IZ2 

U4         Q         1 

a.  first  iteration  with  I.C 

Uo,  Ui,  U2,  U3  are  enabled 
U4  is  disabled 

b.  second  iteration 

Uo,  Ui  are  enabled 

U2,  U3,  U4  are  disabled 

c.  third  iteration 

Uo,  Ui,  U4  are  enabled 

U2,  U3  are  disabled 
*  all  faulty  nodes  are  disabled,  all  fault-free  nodes  are  enabled. 
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Case  3.   A  five  unit  multiprocessor  system,  Ui  and  U4  are  faulty  units  and  enabled 
initially.  Test  results  are  worst  case,  disabling  criteria  is  2. 
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U3        U2 
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a.  first  iteration  with  I.C 

all  nodes  are  enabled 

b.  second  iteration 

Uo,L 

h,  U2,  U3  are  enabled 

U4is 

disabled 

*  system  stays  in  that  state  forever 

*  so  system  i 

s  not  2-fault  2-correctable 

Case  4. 

This  system  is  the  same  as  case  3.  Only  the  test  results  are  arbitrary  case. 

Hi         U3 

Uo 

Q          0 

Uo        Hi 

111 

1          1 

Hi       Uo 
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U2        1  0 

U2    m 

U3  0  1 

U3         U2 

114  1  1 

a.  first  iteration  with  I.C 

all  nodes  are  enabled 

b.  second  iteration 

Uo,  U2,  U3  are  enabled 

Ui,U4are  disabled 
*  all  faulty  nodes  are  disabled,  all  fault-free 
enabled. 

Case  5.  A  seven  unit  multiprocessor  system,  with  Ui,  U3,  U5  are  faulty  and  enabled 
initially.  Test  results  are  worst  case  and  disabling  criteria  is  1. 
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Us        U4       111 
U6         1  0  1 

a.  first  iteration  with  I.C 

Ui,  U3,  U5  are  enabled 

Uo,  U2,  U4,  U6  are  disabled 

*  system  stays  in  that  state  forever.  So  system  is  not  3-fault  1 -correctable 

Case  6.  This  system  is  the  same  as  previous  case,  but  disabling  criteia  is  2. 

a.  first  iteration  with  I.C 

Uo,  Ui,  U2,  U3,  U5  are  enabled 

U4,U6  are  disabled 

b.  second  iteration 

Uo,  Ui,  U2,  U5  are  enabled 

U3,  U4,  U6  are  disabled 

c.  third  iteration 

Uo,  Ui,  U2,  U4,  U5,  U6  are  enabled 

U3  disabled 

d.  fourth  iteration 

Uo,  U2,  U4,  U6  are  enabled 

Ui,  U3,  U5  are  disabled 

*  all  faulty  nodes  are  disabled,  all  fault-free  nodes  are  enabled. 

Case  7.  A  seven  unit  multiprocessor  system,  Ui,  U3,  U5  are  faulty  units  and  enabled 
initially.  Test  results  are  arbitrary  case,  disabling  criteria  is  2. 
U6        Us      U4 
Uo         0         1  0 

Us. 

Ill  1 
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U6        _Q 
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a.  first  iteration  with  I.C 

all  nodes  are 

enabled 

b.  second  iteration 

Uo,  U2,  U4,  U6  are  enabled 

Ui,  U3,  U5  are  disabled 

*  all  faulty  nodes  are 

:  disabled,  all  fault-free  are  enabled 

Case  8.  A  seven  unit  multiprocessoi 

•  system, 

Uo,  Ui, 

U3, 

u4 

are  faulty  units  and  Ui 

U3,  U4  are  enabled  initially. 

Test  results ; 

ire  worst  case,  disabling 

criteria  is  1. 
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U6 

a.  first  iteration  with  I.C 

Uo,  Ui,  U3,  U4  are  enabled 

U2,  Us,  U6  are  disabled 

*  system  stays  in  that  state  forever.  So  it's  not  4-fault  1 -correctable. 

Case  9.    This  case  is  the  same  as  the  previous  case,  except  the  test  results  are 
arbitrary  case. 
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.  first  iteration  with  I.C 

Ulis 

enabled 

Uo,  U2,  U3,  U4,  Us,  U6  are  disabled 

62 


b.  second  iteration 

Uo,  Ui,  U4,  U5,  U6  are  enabled 
U2,  U3  are  disabled 

c.  third  iteration 

U4  is  enabled 

Uo,  Ui,  U2,  U3,  U5,  U6  are  disabled 

d.  fourth  iteration 

Ui,  U2,  U3,  U4  are  enabled 

Uo,  U5,  U6  are  disabled 

e.  fifth  iteration 

Ui  is  enabled  and  Uo,  U2,  U3,  U4,  U5,  U6  are  disabled. 

*  This  is  iteration  #l.So  system  is  not  4-fault,  1-  correctable. 

Case  10.  This  system  is  the  same  as  case  8,  disabling  criteria  is  2  in  this  case. 
The  test  results  will  be  the  same  as  in  case  #8. 

a.  first  iteration  with  I.C 

Uo,  Ui,  U2,  U3,  U4  are  enabled 

U5,  U6  are  disabled. 

b.  second  iteration 

Uo,  Ui,  U3,  U4  are  enabled 

U2,  U5,  U6  are  disabled 

c.  third  iteration 

Uo,  Ui,  U3,  U4  are  enabled 

U2,  U5,  U6  are  disabled 

*  This  is  I.C  (initial  condition)  state,  system  stays  in  that  loop  for  ever.  That  means 
system  is  not  4-fault  2-correctable. 

Case  11.  This  is  the  same  as  case  9,  disabling  criteia  is  2  in  this  case. 

Test  results  will  be  the  same  as  in  case  #9. 
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a.  first  iteration 

all  nodes  are  enabled 

b.  second  iteration 

U2,  U5,  U6  are  enabled 

Uo,  Ui,  U3,  U4  are  disabled 
*  all  faulty  units  are  disabled,  all  fault-free  units  are  enabled. 


Case  12. 

An 

sight  unii 

:  multiprocessor  system,  Uo,  U3,  U5,  U7  are  faulty  units  and 

Uo,  U3,  U5  are  enabled  initially.  Test  results  are  worst  case,  disabling  criteria  is  1. 
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a.  first  iteration  with  I.C 

Uo,  U3,  U5,  U7  are  enabled 
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U2,  Ui,  U4,  U6  are  disabled 

*  system  stays  in  that  forever.  So  system  is  not  4-fault,  1 -correctable. 
When  we  try  to  simulate  if  the  system  is  2 -correctable. 
We  can  easily  see  that  Uo  will  never  be  disabled  in  that  case.  So  system  is  not 
2-correctable  either. 

Case  13.    This  the  same  as  case  12,  but  test  results  are  arbitrary  and  disabling 
criteria  is  2. 
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a.  first  iteration  with  I.C 
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.. 

b.  second  iteration 

Ui,  U2,  U4  and  U6  an 

;  enabled 
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Uo,  U3,  Us,  U7  are 

disabled 

*  all  faulty  units  are  disabled,  all  fault-free  units  are  enabled. 

Case  14. 

A  nine  unit 

multiprocessor  system,  Uo,  Ui,  U2,  U3  are  faulty  units  and  Uo, 

U2,  U3  are  enabled. 

Test  results  are  worst  case,  disabling  criteria  is  1. 
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a.  first  iteration  with  I.C 

Uo,  Ui,  U2,  U3,  Us  are  enabled 

U4,  U5,  U6,  U7  are  disabled 

b.  second  iteration 

Us  is  enabled 
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all  the  others 

are  disabled 

c.  third  iteration 

U4,U5,U6,U7,U8J 

ire  enabled 

Uo,U: 

l,U2,  U3  are  disabled 

*  all  faulty  nodes  are 

disabled,  fault-free  nodes  are  enabled. 

Case  15. 

A  nine  unit  multiprocessor  system, Uo,  U3,  U5,  Us  are  faulty  units  and  U3 

LJ5,  Us  are  enabled. 

Test  results  are 

arbitrary  case,  disabling  criteria  is  1. 
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a.  first  iteration  with  I.C 

Ui,  U5,  Us  are  enabled 
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Uo,  U2,  U3,  U4,  U6,  U7  are  disabled 

b.  second  iteration 

Ui,  U4,  Us  are  enabled 

Uo,  U2,  U3,  U5,  U6,  U7  are  disabled 

c.  third  iteration 

Ui,  U4,  U6,  U7  are  enabled 

Uo,  U2,  U3,  U5,  Us  are  disabled 

d.  fourth  iteration 

Ui,  U2,  U4,  U6,  U7  are  enabled 

Uo,  U3,  U5,  Us  are  disabled 

*  All  faulty  nodes  are  disabled,  all  fault-free  nodes  are  enabled. 

Case  16.  This  the  same  as  previous  case  but  disabling  criteria  is  2. 

a.  first  iteration 

Uo,  Ui,  U2,  U3,  U4,  U5,  U6,  Us  are  enabled 
U7  is  disabled 

b.  second  iteration 

Ui,  U4,  U6  are  enabled 

Uo,  U2,  U3,  U5,  U7,  Us  are  disabled 

c.  third  iteration 

Uo,  Ui,  U2,  U3,  U4,  U6,  U7  are  enabled 
U5,  Us  are  disabled 

d.  fourth  iteration 

Ui,  U2,  U4,  U6,  U7  are  enabled 

Uo,  U3,  U5,  Us  are  disabled. 

*  All  faulty  nodes  are  disabled,  all  fault-free  nodes  are  enabled. 

Case  17.  This  case  is  the  same  as  case  15,  but  disabling  criteria  is  3. 

a.  first  iteration 

all  nodes  will  be  enabled 
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b.  second  iteration 

Ui,  U2,  U4,  U6,  U7  are  enabled 

UO,  U3,  U5,  U8  are  disabled 
*  All  faulty  nodes  are  disabled,  all  fault-free  nodes  are  enabled. 

Case  18.  A  nine  unit  multiprocessor  system,  Ui,  U3,  U5,  U8  are  faulty  units  and  Ui, 
U3,  U5  are  enabled.  Test  results  are  worst  case,  disabling  criteria  is  2. 
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a.  first  iteration  with  I.C 

Uo,  Ui,  U2,  U3,  U5,  Us  are  enabled 

U4,  U6,  U7  are  disabled. 

b.  second  iteration 

Ui,  Us,  Us  are  enabled 

Uo,  U2,  U3,  U4,  U6,  U7  are  disabled. 

c.  third  iteration 

Ui,  U3,  U4,  U5,  U6,  U7,  U8  are  enabled 
Uo,  U2  are  disabled. 

d.  fourth  iteration 

U3,  U5  are  enabled. 

Uo,  Ui,  U2,  U4,  U6,  U7,  U8  are  disabled. 

e.  fifth  iteration 

Uo,  Ui,  U2,  U3,  U4,  U5,  Us  are  enabled 

U6,  U7  are  disabled. 

f.  sixth  iteration 

Ui.Us  are  enabled 

Uo,  U2,  U3,  U4,  Us,  U6,  U7  are  disabled. 
*  System  is  not  4-fault  2-correctable. 

Case  19.   A  six  unit  multiprocessor  system,  Ui,  U3,  Us  are  faulty  units  and  only  Ui 

is  disabled,  all  the  other  units  are  enabled.    Test  results  are  arbitrary  case,  disabling 
criteria  is  2. 
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a.  first  iteration 

Uo,  U2,  U3,  U4  are  enabled 

Ui.Usare  disabled. 

b.  second  iteration 

Uo,  Ui,  U2,  U3,  U4  are  enabled. 

Us  is  disabled. 

c.  third  iteration 

Uo,  Ui,  U2,  U4  are  enabled. 

U3,  U5  are  disabled. 

d.  fourth  iteration 

Uo,  Ui,  U2,  U4,  U5  are  enabled. 

U3  is  disabled. 

e.  fifth  iteration 

Uo,  U2,  U4,  U5  are  enabled. 
Ui,  U3  are  disabled. 

f.  sixth  iteration 

Uo,  U2,  U3,  U4,  U5  are  enabled. 

Ui  is  disabled. 

*That  is  I.C  state  and  system  oscillates  and  returns  to  I.C  state  in  every  six  iteration. 


71 


LIST  OF  REFERENCES 

I.  J.H.  Wesley,  et.  al.,  "SIFT:  Design  and  analysis  of  fault  tolerant  computer  for 
Aircraft  Control,"  Proc.  of  IEEE, Vol.  66,  No.  10,  pp.  1240-1255,  October  1978. 

2  .       F.P.  Preparata,  G.Metze,  and  R.T.Chien.,"On  the  connection  assignment  problem  of 
diagnosable  systems,"  IEEE  Trans,  on  Comp.,  Vol.  C-16,  pp.  848-854,  Dec.  1967. 

3.  K.  Y.  Chwa  and  S.L.Hakimi,  "Schemes  for  fault  tolerant  computing:  A  comparison  of 
modularly  redundant  and  t-diagnosable  systems,"  Inform,  and  Control,  Vol.  49,  No.  3, 
pp.  212-238,  June  1981. 

4.  Arthur  D.Friedman  and  Luca  Simoncini,  "System  level  fault  diagnosis,"  IEEE  Trans, 
on  Comp.,  Vol.   13,  p.  47-2,  March  1980. 

5.  Simoncini  Karunanithi  and  A.D.  Friedman, "System  diagnosis  with  t/s  diagnosability," 
Proc.  of  the  7  th  Fault-tolerant  Comp.  Symp.,  pp.  65-71,  June  1977 

6.  M.L.  Blount,  "Probabilistic  Treatment  of  Diagnosis  in  Digital  systems,  Proc.  7th  Intl. 
Conf.  on  Fault  Tolerant  Computing,  pp.  72-77,  June  1977. 

7.  J.T.  Butler,  "On  the  design  of  distributed  diagnosable  multiprocessing  systems," 
Naval  Postgraduate  School  Monterey,  CA,  research  proposal. 

8.  A.L.  Hopkins,  T.B.  Smith,  and  J.H.  Lala,  "  FTMP-A  highly  reliable  fault-tolerant 
multiprocessor  for  aircraft,"  Proc.  of  IEEE,  Vol.  66,  No.  10,  pp.  1221-1239, 
October  1978. 

9.  R.  Nair,  G.  Metze  and  J.  Abraham,  "Design  Considerations  for  Fault  -Tolerant 
Distributed  Digital  Systems,"  unpublished  manuscript. 

10.  S.  Mallela  and  G.  Masson,  "Diagnosable  systems  for  intermittent  faults,"  IEEE  Trans, 
on  Comp.,  Vol.  C-27,  pp.  560-566,  1978. 

II.  Stephan  G.  Kochan  "Programming  in  C,"  Hayden  Book  Company,  1983. 


72 


INITIAL  DISTRIBUTION  LIST 


No.copies 

1.  Defense  Technical  Information  Center  2 

Cameron  Station 
Alexandra,  VA  22304-6145 

2.  Library,  Code  0142  2 
Naval  Postgraduate  School 

Monterey,  CA  93943-5002 

3.  Department  Chairman,  Code  62  1 
Department  of  Electrical  and  Computer  Engineering. 

Naval  Postgraduate  School 
Monterey,  CA  93943-5000 

4.  Dr.  Jon  T.  Butler,  Code  62  BU  5 
Department  of  Elecrical  and  ComputerEngineering. 

Naval  Postgraduate  School 
Monterey,  CA  93943-5000 

5.  Dr.  Bruno  O.  Shubert,  Code  55  SY  1 
Department  of  Operational  Analysis 

Naval  Postgraduate  School 
Monterey,  CA  93943 

6.  Dr.  Dana  E.  Madison,  Code  52  1 
Department  of  Computer  Science 

Naval  Postgraduate  School 
Monterey,  CA  93943 

7.  Dr.  Joo  Kang  Lee  1 
POSTECH  Research  Institute  of  Science 

and  Technology 
PO.Box.125,  Pohang  City 
Kyungbuk  680  KOREA. 

8.  Director  of  Research  Administration,  Code  012  1 
Naval  Postgraduate  School 

Monterey,  CA  93943-5000 

9.  Kara  Kuwetleri  Komutanligi  1 
Egitim  Dairesi  Baskanligi 

Bakanliklar,  Ankara,  Turkey 

10.  Kara  Harp  Okulu  1 
Bakanliklar,  Ankara,  Turkey 


73 


11.  Muhabere  Okul  Komutanligi  1 
Mamak,  Ankara,  Turkey 

12.  Capt.  Ibrahim  Dincer  1 
Muhaber  okulu 

Ogrenm  Kurulu 
Maraak,  Ankara,  Turkey 

13.  Lrjg.  Mustafa  Paktuna  1 
Marmara  cad.  No:  158/6 

Kocamusiafapasa,  Istanbul,  Turkey 

14.  Dr.  Andre  von  Tibborg  1 
Code  1133 

ONR 

800  N.  Quincy 

Arlington,  VA  22217 

15.  Dr.  George  Abraham  1 
Code  7500 

NRL 

M-5  5  5  Overlook  Ave.  S.W. 

Washington,  DC  20375 

16.  Dr.  Lou  Schmid  1 
OHT  20T 

800  N.  Quincy  Ave.  Room  811 
Arlington,  VA  22217 


74 


3^3  -  **9 


Thesis 


ir  *£&&£ 


distr 
networks 


Thesis 

D576485  Bincer 

C<1        Fault  diagnosis  in 

distributed  computer 

networks. 


